Data Visualization: ggplot2 tutorial using gapminder dataset

If I can’t picture it, I can’t understand it. Albert Einstein

Overview

In this tutorial we look at some of the data on wealth and life expectancy of countries over time used by Hans Rosling, known as gapminder. The goal is to provide an overview of how to graph a variable (data) depending on its type, introduce some simple 1D and 2D plots constructed using ggplot2() and provide an outline of the layered grammar of graphics upon which ggplot2() is built.

Learning objectives

  • Generate plots from data according to their type (discrete, continuous …)
  • Manage plot settings
  • Produce plots from data in a data frame
  • Modify and customize a plot
  • Create complex and fancy plot

Loading/installing packages

library(ggplot2) #for plotting
library(dplyr)    #for data manipulation
library(scales)  #for graphical scales
library(gapminder) #for dataset
library(plotly) # adds a frame aesthetic to ggplot, and allows interactive, linked views of a series of frames over time

Let’s have a look to our data structure

str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

The print() method gives an abbreviated printout.

gapminder
## # A tibble: 1,704 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ℹ 1,694 more rows

It is useful to get some overview of the variables before getting started.

summary(gapminder)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

We will want to look at trends over time by continent. How many countries are in this data set in each continent? There are 12 years for each country. Are the data complete? table() gives an answer.

table(gapminder$continent, gapminder$year)
##           
##            1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
##   Africa     52   52   52   52   52   52   52   52   52   52   52   52
##   Americas   25   25   25   25   25   25   25   25   25   25   25   25
##   Asia       33   33   33   33   33   33   33   33   33   33   33   33
##   Europe     30   30   30   30   30   30   30   30   30   30   30   30
##   Oceania     2    2    2    2    2    2    2    2    2    2    2    2

Note: we used the $ symbol with data$variable notation because table() doesn’t have a data= argument. Another way to do this is to use the with() function, that makes variables in a data set available directly. The same table can be obtained using:

with(gapminder, {table(continent, year)})
##           year
## continent  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
##   Africa     52   52   52   52   52   52   52   52   52   52   52   52
##   Americas   25   25   25   25   25   25   25   25   25   25   25   25
##   Asia       33   33   33   33   33   33   33   33   33   33   33   33
##   Europe     30   30   30   30   30   30   30   30   30   30   30   30
##   Oceania     2    2    2    2    2    2    2    2    2    2    2    2

1D plots: Bar plots for discrete variables

As we have seen previously during the lecture, the distribution of a categorical variable is better vizualised using a bar plot. For example, continent. With ggplot2, this is relatively easy:

  • we start by mapping the x variable to continent
  • then, we add a geom_bar() layer, that counts the observations in each category and plots them as bar lengths.
ggplot(gapminder, aes(x=continent)) + geom_bar()

To make this more colorful, you can also map the fill attribute to continent.

ggplot(gapminder, aes(x=continent, fill=continent)) + geom_bar()

With ggplot2 features, we will be able also to:

  • change the default color schemes
  • modify labels
  • change the legend position, or eliminate it in same case
  • flip axis …

Let’s try some !

  • We will change the y axis, count, in geom_bar() to ..count../12 in order to represent the number of countries.
  • Change the label of the y axis by a more meaningful one: countries
  • Suppress the default legend for continent, which is redundant in this case
ggplot(gapminder, aes(x=continent, fill=continent)) + 
    geom_bar(aes(y= ..count../12)) +
    labs(y="Number of countries") +
    guides(fill=FALSE)

Note: Ever plot in ggplot2 is a ggplot object.

If you want to save a given plot for a future use, store it in a variable by using: mybar <- ggplot() + ... `

mybar <- ggplot(gapminder, aes(x=continent, fill=continent)) + 
    geom_bar(aes(y=..count../12)) +
    labs(y="Number of countries") +
    guides(fill=FALSE)
mybar

Some other ggplot2 features

  • Transforming coordinates using coord_trans function
mybar + coord_trans(y="sqrt")

  • Flipping axes using coord_flip function
mybar + coord_flip()

  • Transform to polar coordinates
mybar + coord_polar()

1D plots: density plots for continuous variables

The gapminder data set contains several continuous variables: life expectancy (lifeExp), population (pop) and gross domestic product per capita (gdpPercap) for each year and country. For such variables, density plots provide a useful graphical summary.

Let’s start by exploring life expectancy. The simplest plot uses this as the horizontal axis, aes(x=lifeExp) and then adds geom_density() to calculate and plot the smoothed frequency distribution.

ggplot(data=gapminder, aes(x=lifeExp)) + 
    geom_density()

We have several features to make this plot prettier. Changing the line thickness (size=), add a fill color (fill=""), and make the fill color partially transparent (alpha=).

ggplot(data=gapminder, aes(x=lifeExp)) + 
    geom_density(size=1.5, fill="pink", alpha=0.3)

Differences by continent

The plot of lifeExp is bimodal, and looks not obvious. We need to add another aesthetic attribute, fill=continent, which is inherited in geom_density() to see more details about countries among continents.

ggplot(data=gapminder, aes(x=lifeExp, fill=continent)) +
    geom_density(alpha=0.3)

Note 1: We used transparent colors (alpha=) to see more clearly the different distributions across continent.

Note 2: It is easy now to see that African countries differ markedly from the rest.

boxplots and other visual summaries

You might want to visualize the distributions of life expectancy by another visual summary, grouped by continent. All you need to do is change the aesthetic to show continent on one axis, and life expectancy (lifeExp) on the other.

gap1 <- ggplot(data=gapminder, aes(x=continent, y=lifeExp, fill=continent))

Then, add ageom_boxplot() layer:

gap1 +
    geom_boxplot(outlier.size=2)

Challenge 1

  • Remove the legend from this plot
  • Make the plot horizontal
  • Instead of a boxplot, try geom_violin()

Effect ordering

The continents are a factor, and are ordered alphabetically by default. It might be more useful to order them by the mean or median life expectancy.

In this example, I use the dplyr “pipe” notation (%>%) to send the gapminder data to the dplyr:;mutate() function, and within that, reorder() the continents by their median life expectancy.

gapminder %>% 
    mutate(continent = reorder(continent, lifeExp, FUN=median))
## # A tibble: 1,704 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ℹ 1,694 more rows

Note: In other situations, you could use FUN=mean, FUN=sd, or FUN=max to sort the levels by their means, standard deviatons, maximums, or any other function.

We can now pipe the result of this right into ggplot:

gapminder %>% 
    mutate(continent = reorder(continent, lifeExp, FUN=median)) %>%
    ggplot(aes(x=continent, y=lifeExp, fill=continent)) +
    geom_boxplot(outlier.size=2)

Exploring at GDP

Let’s look at the distribution of gdpPercap in a similar way, starting with the unconditional distribution.

ggplot(data=gapminder, aes(x=gdpPercap)) + 
    geom_density() 

Challenge 2

  • As we did for lifeExp plot the distributions separately for each continent
  • It is probably more useful to plot GDP on a log scale. Add another layer that transforms the x axis to log10(gdpPercap).
  • Make boxplots of gdpPercap by continent.
  • Do the same, but plot GDP on a log scale.

1.5D: Layers & Time series plots

Layers

Exploring how life expectancy change with GDP per country, for expample china. We can use geom_line to make a line plot.

china <- ggplot(subset(gapminder, country =="China"), #subsetting data
            aes(x=gdpPercap, y=lifeExp))
china + geom_line() 

We can use both geom_line and geom_point to make a line plot with points at the data values.

china + geom_line() + geom_point()  #adding points to data values 

Note: This brings up another important concept with ggplot2: layers. A given plot can have multiple layers of geometric objects, plotted one on top of the other.

If we make the lines and points different colors, we can see that points are placed on top of the lines, since they are in the second layer.

china + geom_line(color="lightblue") + geom_point(color="violetred") #adding some colors 

If we switch the order of geom_point() and geom_line(), we’ll reverse the layers.

china + geom_point(color="violetred") + geom_line(color="lightblue") 

Note: aesthetics that are included in the call to ggplot2() (or completely separately) are made to be the defaults for all layers, but we can separately control the aesthetics for each layer. For example, we could color the points by year:

china + geom_line() + geom_point(aes(color=year)) #color the point by year

With a rainbow:

china + geom_line() + geom_point(aes(color=year))+ scale_color_gradientn(colours = rainbow(5)) #with a rainbow

Coloring both points and lines:

china + geom_line() + geom_point() + aes(color=year) #coloring both point and line

china + geom_line() + geom_point() + aes(color=year)+ scale_color_gradientn(colours = rainbow(5)) #both with rainbow shade

Challenge 3

  • Make a plot of lifeExp vs gdpPercap for China and India, with both lines and points.

Time series plot

Exploring how has life expectancy changed over time. The simplest way to to plot a line for each country over year. To do this, we use the group aesthetic.

ggplot(gapminder, aes(x=year, y=lifeExp, group=country)) + #using the aesthetic group
  geom_line()

Adding colors:

ggplot(gapminder, aes(x=year, y=lifeExp, group=country , color = continent)) + #adding color
  geom_line()

Changing colors shade:

ggplot(gapminder, aes(x=year, y=lifeExp, group=country , color = continent)) + #changing colors shade
  geom_line(alpha = 0.5)

Plotting a summary

A better look at trends over time is to find the mean or median for each year and continent and plot those.

gapminder %>%
  group_by(continent, year) %>%
  summarise(lifeExp=median(lifeExp)) %>% head() #median for each year and continent
## # A tibble: 6 × 3
## # Groups:   continent [1]
##   continent  year lifeExp
##   <fct>     <int>   <dbl>
## 1 Africa     1952    38.8
## 2 Africa     1957    40.6
## 3 Africa     1962    42.6
## 4 Africa     1967    44.7
## 5 Africa     1972    47.0
## 6 Africa     1977    49.3

One nice feature of the dplyr and tidyverse framework, is that you can pipe the result of such a summary directly to ggplot():

gapminder %>% #piping to ggplot
  group_by(continent, year) %>%
  summarise(lifeExp=median(lifeExp)) %>%
  ggplot(aes(x=year, y=lifeExp, color=continent)) +
  geom_line(size=1) + 
  geom_point(size=1.5)

If you want to make several plots of such a summarized data set, save the result in a new object.

gapminder %>% #saving in a new dataset using assignement
  group_by(continent, year) %>%
  summarise(lifeExp=median(lifeExp)) -> gapyear

Let’s play with our plot and make it more fancy!

We can fit linear regression lines for each continent instead of joining all the points:

ggplot(gapyear, aes(x=year, y=lifeExp, color=continent)) +  #fitting linear regression lines for each continent
  geom_point(size=1.5) +
  geom_smooth(aes(fill=continent), method="lm")

We can also use a loess smooth rather than a linear regression:

ggplot(gapyear, aes(x=year, y=lifeExp, color=continent)) +  #using a loess smooth
  geom_point(size=1.5) +
  geom_smooth(aes(fill=continent), method="loess")

We can change the default use of legends by placing it inside the plot:

ggplot(gapyear, aes(x=year, y=lifeExp, color=continent)) +  #using a loess smooth
  geom_point(size=1.5) +
  theme(                
    legend.position = c(0.99, 0.03),
    legend.justification = c("right", "bottom") #placing the legend inside the plot
  )+
  geom_smooth(aes(fill=continent), method="loess")

2D: Scatterplots

Let’s explore the relationship between life expectancy and GDP with a scatterplot,

A basic scatterplot is set up by assigning two variables to the x and y aesthetic attributes then we can add the points in another layer.

plt <- ggplot(data=gapminder,
              aes(x=gdpPercap, y=lifeExp))
plt + geom_point()

Or, color them by continent.

plt + geom_point(aes(color=continent)) #adding color by continent

For a better look, we can also add a smoothed curve for all the data:

plt + geom_point(aes(color=continent)) +
  geom_smooth(method="loess")  #adding a smoothed curve for all the data

As we have seen earlier about GDP, this variable is better plotted on a log scale:

plt + geom_point(aes(color=continent)) +
  geom_smooth(method="loess") +
  scale_x_log10()     #plotting on a log scale

Customizing the plot

The last plot, on the log scale has ugly labels, let’s try to adjust the scale:

plt + geom_point(aes(color=continent)) +
  geom_smooth(method="loess") +
  scale_x_log10(labels=scales::comma)    #adjusting scale

Moving the legends inside the plot:

plt + geom_point(aes(color=continent)) +
  geom_smooth(method="loess") +
  scale_x_log10(labels=scales::comma) +
  theme(legend.position = c(0.8, 0.2)) # putting the legend inside the plot

Changing the theme:

plt + geom_point(aes(color=continent)) +
  geom_smooth(method="loess") +
  scale_x_log10(labels=scales::comma) +
  theme_bw()   #changing the theme of the plot

Replacing the single loess smoothed curve with a separate regression line for each continent:

plt + geom_point(aes(color=continent)) +
  geom_smooth(aes(fill= continent) , method="lm") +
  scale_x_log10(labels=scales::comma) +
  theme_bw()     #smoothing by a regression line for each continent

Making a “bubble” plot, mapping the size of each point to population (pop)

plt + geom_point(aes(size = pop, color=continent)) +  #making a bubble plot by mapping the size of each point to population
  geom_smooth(method="lm") +
  scale_x_log10(labels=scales::comma) +
  theme_bw()

Changing color shades:

plt + geom_point(aes(size = pop, color=continent), alpha = 0.5) +  #changing colors shade
  geom_smooth(method="lm") +
  scale_x_log10(labels=scales::comma) +
  theme_bw()

Let’s explore life expectancy by continent for a giving year. To do that, we will need to filter our data.

gm_2007 <- subset(gapminder, year==2007) #filtering data by picking those of 2007 
ggplot(gm_2007, aes(y=lifeExp, x=continent)) + geom_point()

ggplot(gm_2007, aes(y=lifeExp, x=continent)) +
  geom_point(position=position_jitter(width=0.1, height=0)) #changing scale by jittering 

Advanced customized and fancy plot

Bubble plot

Explorinf gdp versus life expectancy in 2007 with highlighting the larger countries filter our data.

ggplot(gm_2007) +
  geom_point(aes(x = gdpPercap, y = lifeExp, color = continent, size = pop),# add scatter points
             alpha = 0.5) +
  geom_text(aes(x = gdpPercap, y = lifeExp + 3, label = country), # add some text annotations for the very large countries
            color = "grey50",
            data = filter(gm_2007, pop > 1000000000 | country %in% c("Nigeria", "United States"))) +
  scale_x_log10(limits = c(200, 60000)) + # clean the axes names and breaks
  labs(title = "GDP versus life expectancy in 2007", # change labels
       x = "GDP per capita (log scale)",
       y = "Life expectancy",
       size = "Popoulation",
       color = "Continent") +
  scale_size(range = c(0.1, 10), # change the size scale
             guide = "none") + # remove size legend
  theme_classic() +   # add a nicer theme
  theme(legend.position = "top",  # place legend at top and grey axis lines
        axis.line = element_line(color = "grey85"),
        axis.ticks = element_line(color = "grey85"))